[PERF] Optimize LingBot-World-Fast chunk profiling and runtime paths by yJader · Pull Request #3 · Tele-AI/TeleFuser

yJader · 2026-06-29T10:21:43Z

Description

This PR adds finer-grained LingBot-World-Fast profiling controls and improves chunk generation performance by reducing avoidable layout conversions, removing eager Triton LayerNorm wrapper overhead, and avoiding CUDA scalar index synchronizations in the self-attention KV cache. It also exposes local attention window settings through the pipeline config so the runtime can size and use the self-KV cache according to the requested window.

Motivation

Profiling showed several costs that were either hard to attribute or avoidable in the current LingBot-World-Fast runtime:

Torch profiler defaults (record_shapes, profile_memory, with_stack) can add high overhead and distort pipeline-level traces.
VAE CausalConv3d and DiT patch embedding receive NCDHW inputs while cuDNN Conv3d kernels prefer channels_last_3d; on current PyTorch/cuDNN this does not fall back to slow_conv_dilated3d, but it still triggers repeated implicit NCHW/NHWC layout transforms.
The eager Triton LayerNorm path spends most of its apparent profiler time in Python/HOP/autotuner wrapper overhead rather than GPU compute.
KV cache index tensors require .item() reads from CUDA tensors, introducing device-to-host synchronization.
Local attention and sink-size settings need to flow from pipeline config into DiT/runtime cache sizing for profiling and generation experiments.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Performance improvement
Code refactoring
Documentation update
Other (please describe):

Changes Made

Profiling controls and trace ranges
- Add env-driven torch profiler options:
  - TELEFUSER_TORCH_PROFILER_RECORD_SHAPES
  - TELEFUSER_TORCH_PROFILER_PROFILE_MEMORY
  - TELEFUSER_TORCH_PROFILER_WITH_STACK
- Keep historical profiler defaults when env vars are unset or invalid.
- Add ProfilingContext4Debug ranges for workloop, create_runtime, generate_next_chunk, denoise_chunk, kv_cache_update_forward, and vae_decode.
LingBot-World-Fast local attention configuration
- Add local_attn_size and sink_size to LingBotWorldFastPipelineConfig.
- Pass those options into LingBotWorldFastDiT.from_pretrained.
- Size the self-KV cache from the local attention window when local_attn_size > -1.
Conv3d layout optimization
- Convert DiT patch embedding input to torch.channels_last_3d before self.patch_embedding.
- Convert WanVideoVAE.CausalConv3d input to torch.channels_last_3d before nn.Conv3d.forward.
- This is not relying on slow_conv_dilated3d avoidance in the current environment. With torch 2.12.1+cu130 and cuDNN 9.20, baseline already uses cuDNN Conv3d. The current gain comes from trading one explicit DtoD layout copy for many fewer implicit cuDNN NCHW/NHWC transforms.
LayerNorm eager path
- Route LayerNorm.forward_cuda to the native PyTorch implementation in eager mode.
- This removes the profiler false hotspot caused by torch.library.wrap_triton / HOP / Triton autotuner wrapper cost on small LayerNorm kernels.
KV cache index synchronization
- Store global_end_index and local_end_index as host Python int values instead of CUDA tensors.
- Keep a compatibility helper for existing tensor values, but update the runtime path to write ints.
- This avoids .item()-driven DtoH syncs in CausalSelfAttention.forward.
Packaging/import robustness
- Add a fallback __version__ = "0.0.0+unknown" when telefuser._version is absent in a source checkout.
Tests added
- tests/unit/utils/test_profiler_flags.py

Testing

Targeted unit tests pass
Manual testing performed
Benchmarks added/updated (if applicable)

python -m pytest -q tests/unit/utils/test_profiler_flags.py

Result: passed, 2 tests.

Checklist

Code follows the project's coding standards (ruff)
Pre-commit hooks pass (pre-commit run --all-files)
All tests pass (pytest tests/)
New tests added for new functionality
Documentation updated (README, CLAUDE.md, docstrings)
Commit messages are clear and descriptive
PR title follows the convention: [TYPE] Brief description

Related Issues

N/A

Additional Notes

Earlier analysis identified an old-environment Conv3d issue where PyTorch 2.9.1 + cuDNN 9.10 could route bf16/fp16 5D Conv3d to aten::slow_conv_dilated3d. The current benchmark environment is different: torch 2.12.1+cu130, CUDA 13.0, cuDNN 9.20, H100. In this environment, the baseline no longer goes through slow_conv_dilated3d; the observed VAE win is from reducing implicit layout transforms.

GPU Architecture Support

SM80 (Ampere, Ada Lovelace)
SM90 (Hopper H100)
SM100+ (Blackwell)

No new custom CUDA/Triton kernels are added. Runtime measurements in this draft were collected on NVIDIA H100 GPUs. The code changes use PyTorch memory-format and native operator paths, so there is no new architecture-specific kernel support matrix to validate.

Performance Impact

Primary no-profiler benchmark:

Config:
- case 03
- frame_num=201
- chunk_size=3
- local_attn_size=21
- sink_size=3
- max_area=399360
- --no-write-video
- CUDA timing sync enabled
- summary skips 1 warmup chunk and reports 16 steady-state chunks
Environment:
- torch 2.12.1+cu130
- CUDA 13.0
- cuDNN 9.20
- NVIDIA H100

Metric	Baseline	Modified	Delta	Relative
`generate_next_chunk_seconds.mean`	2.932498 s	2.862083 s	-0.070415 s/chunk	-2.40%
`denoise_seconds.mean`	2.043713 s	2.036162 s	-0.007551 s/chunk	-0.37%
`update_cache_seconds.mean`	0.500482 s	0.498508 s	-0.001974 s/chunk	-0.39%
`decode_seconds.mean`	0.388245 s	0.327375 s	-0.060869 s/chunk	-15.68%
`total_seconds.mean`	2.932894 s	2.862172 s	-0.070722 s/chunk	-2.41%

At this resolution/config, decode accounts for roughly 86% of the steady-state generate_next_chunk improvement.

Profiler trace attribution:

Config:
- case 03
- frame_num=89
- chunk_size=3
- local_attn_size=21
- sink_size=3
- max_area=99840
- profiler enabled for create_runtime,generate_next_chunk
Important caveat: profiler-enabled total_seconds is dominated by profiler overhead. Use stage timing and GPU kernel attribution, not end-to-end profiler wall time.

Profiler timing summary:

Metric	Baseline	Modified	Delta	Relative
`generate_next_chunk_seconds.mean`	1.024707 s	0.625288 s	-0.399418 s/chunk	-38.98%
`denoise_seconds.mean`	0.745963 s	0.436910 s	-0.309053 s/chunk	-41.43%
`update_cache_seconds.mean`	0.165109 s	0.094973 s	-0.070136 s/chunk	-42.48%
`decode_seconds.mean`	0.108267 s	0.090073 s	-0.018194 s/chunk	-16.80%

DiT GPU attribution from analyze_telefuser_dit_profile.py:

Trace	Chosen GPU pid	Raw GPU time	Clean GPU time
Baseline	1	460.566 ms	448.443 ms
Modified	1	472.059 ms	459.920 ms

The DiT GPU kernel time is not the source of the current speedup in this trace.

VAE/GPU0 copy-layout attribution:

Metric	Baseline	Modified	Delta
GPU0 total kernel time	86.184 ms	74.733 ms	-11.451 ms (-13.29%)
layout/copy family, excluding host copies	~27.16 ms	~16.36 ms	~-10.8 ms
`torch_direct_copy_kernel`	14.641 ms / 434 launches	8.782 ms / 263 launches	-5.859 ms
`cudnn_nchw_to_nhwc`	7.561 ms / 207 launches	0.351 ms / 12 launches	-7.210 ms
`cudnn_nhwc_to_nchw`	3.392 ms / 105 launches	0.158 ms / 6 launches	-3.234 ms
`Memcpy DtoD`	0.447 ms / 70 launches	5.950 ms / 350 launches	+5.503 ms

Interpretation: the explicit x.contiguous(memory_format=torch.channels_last_3d) increases visible Memcpy DtoD, but it removes more implicit cuDNN NCHW/NHWC transforms and direct-copy work. This is why the modified branch can show higher Memcpy DtoD while still reducing total VAE decode time.

Supplemental Trace Figures

The following profiler screenshots are included as supplementary evidence. The numeric benchmark tables above remain the source of truth for this PR.

Historical VAE Conv3d trace from the earlier PyTorch/cuDNN environment. This explains why the channels_last_3d change was originally investigated. It should not be read as the current torch 2.12.1+cu130 behavior, where baseline no longer falls back to slow_conv_dilated3d.

Profiler-side generate_next_chunk / VAE timing view, showing the stage-level before/after context that motivated the VAE decode attribution.

LayerNorm profiler hotspot. The linked GPU kernel is only tens of microseconds, while the visible range is dominated by eager wrap_triton / HOP / autotuner wrapper cost.

LayerNorm fix context: route eager execution to the native PyTorch implementation instead of the small Triton wrapper path.

KV cache index trace showing DtoH synchronization from reading CUDA scalar index tensors with .item(). The PR changes those indices to Python int values in the runtime path.

…ing in runtime and service

…ables

lzx1413 · 2026-06-30T08:31:35Z

LGTM

yJader added 8 commits June 23, 2026 07:53

feat(pipeline/lingbot): add profiling context for performance monitor…

285bb41

…ing in runtime and service

feat(pipeline): add local attention and sink size configuration options

a122ea3

feat(utils/profiler): Expose torch.profiler flags by environment vari…

aba8f1c

…ables

chore(gitignore): ignore uv lock file

27ec05a

fix(package): handle missing generated version

cc2e5f0

perf(ops): use native layernorm in eager mode

2d9c91c

perf(models): use channels-last 3d for conv inputs

161d1d5

perf(lingbot): keep kv cache indices on host

c6a4024

lzx1413 merged commit 84e3f10 into Tele-AI:main Jun 30, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PERF] Optimize LingBot-World-Fast chunk profiling and runtime paths#3

[PERF] Optimize LingBot-World-Fast chunk profiling and runtime paths#3
lzx1413 merged 8 commits into
Tele-AI:mainfrom
yJader:feature/lingbot-world-fast-profiling

yJader commented Jun 29, 2026

Uh oh!

lzx1413 commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yJader commented Jun 29, 2026

Description

Motivation

Type of Change

Changes Made

Testing

Checklist

Related Issues

Additional Notes

GPU Architecture Support

Performance Impact

Supplemental Trace Figures

Uh oh!

lzx1413 commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants